Let's start by importing the data to a pandas DataFrame from a CSV file:
In [1]:
import pandas as pd
In [2]:
raw_data = pd.read_csv('datasets/titanic.csv')
raw_data.head()
Out[2]:
In [3]:
raw_data.info()
The information above shows that this dataset consists of data for 891 passengers: their names, gender, age, etc (for a complete description of the meaning of each column, check this link)
In [4]:
# Percentage of missing values in each column
(raw_data.isnull().sum() / len(raw_data)) * 100.0
Out[4]:
It can be seen that 77% of the passengers do not present information about which cabin it was allocated to. This information could be useful for further analysis but, for now, let's drop this column:
In [5]:
raw_data.drop('Cabin', axis='columns', inplace=True)
raw_data.info()
The column Embarked, that informs on which port the passenger embarked, only has a few missing entries. Since the amount of passanger with missing values is negligible, they can be discarded without much harm:
In [6]:
raw_data.dropna(subset=['Embarked'], inplace=True)
(raw_data.isnull().sum() / len(raw_data)) * 100.0
Out[6]:
Finally, the age is missing from around 20% of the passengers. It's not reasonable to drop all these passengers nor dropping the column as a whole, so one possible solution is to fill the missing values with the median age of the dataset:
In [7]:
raw_data.fillna({'Age': raw_data.Age.median()}, inplace=True)
(raw_data.isnull().sum() / len(raw_data)) * 100.0
Out[7]:
The median represents a robust statistics. A statistics is a number that summarizes a set of values, while a statistics is said to be robust if it is not significantly affected by variations in the data.
Suppose we have a group of people whose ages are [15, 16, 14, 15, 15, 19, 14, 17]. The average age in this groupo is 15.625. If a 80-year old person gets added to this group, its average age will now be 22.77 years, which does not seem to represent well the age profile of the group. The median age of this group in both cases, instead, is 15 years - i.e. the median value was not changed by the presence of an outlier in the data, which makes it a robust statistics for the ages of the group.
Now that all of the passengers' information has been "cleaned", we can start to analyse the data.
Let's start by exploring how many people in this dataset survived the Titanic:
In [8]:
import matplotlib.pyplot as plt
%matplotlib inline
In [9]:
overall_fig = raw_data.Survived.value_counts().plot(kind='bar')
overall_fig.set_xlabel('Survived')
overall_fig.set_ylabel('Amount')
Out[9]:
Overall, 38% of the passengers survived.
Now, let's segment the proportion of survivors along different profiles (the code to generate the following graphs was taken from this link).
In [10]:
survived_sex = raw_data[raw_data['Survived']==1]['Sex'].value_counts()
dead_sex = raw_data[raw_data['Survived']==0]['Sex'].value_counts()
df = pd.DataFrame([survived_sex,dead_sex])
df.index = ['Survivors','Non-survivors']
df.plot(kind='bar',stacked=True, figsize=(15,8));
In [11]:
figure = plt.figure(figsize=(15,8))
plt.hist([raw_data[raw_data['Survived']==1]['Age'], raw_data[raw_data['Survived']==0]['Age']],
stacked=True, color=['g','r'],
bins=30, label=['Survivors','Non-survivors'])
plt.xlabel('Idade')
plt.ylabel('No. passengers')
plt.legend();
In [12]:
import matplotlib.pyplot as plt
figure = plt.figure(figsize=(15,8))
plt.hist([raw_data[raw_data['Survived']==1]['Fare'], raw_data[raw_data['Survived']==0]['Fare']],
stacked=True, color=['g','r'],
bins=50, label=['Survivors','Non-survivors'])
plt.xlabel('Fare')
plt.ylabel('No. passengers')
plt.legend();
The graps above indicate that passenger who are female, are less than 20 years and/or paid higher fares to embark have a greater chance to have survived the Titanic (what a surprise!). How precisely can we use this information to be able to predict if a passenger would survive the accident?
Let's start by preserving onle the information that we wish to use - we'll keep the passenger names for further analysis:
In [13]:
data_for_prediction = raw_data[['Name', 'Sex', 'Age', 'Fare', 'Survived']]
data_for_prediction.is_copy = False
data_for_prediction.info()
In [14]:
data_for_prediction['Sex'] = data_for_prediction.Sex.map({'male': 0, 'female': 1})
data_for_prediction.info()
In order to be able to assess the model's predictive power, part of the data (in this case, 25%) must be separated into a validation set.
A validation set is a dataset for which the expected vallues are known but that is not used to train the predictive model - this way, the model will not be biased with information from these entries and this data set can be used to estimate the error rate.
In [15]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data_for_prediction, test_size=0.25, random_state=254)
len(train_data), len(test_data)
Out[15]:
We'll use a simple Decision Tree model to predict if a passenger would survive the Titanic by making use of its gender, age and fare.
In [16]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier().fit(train_data[['Sex', 'Age', 'Fare']], train_data.Survived)
tree.score(test_data[['Sex', 'Age', 'Fare']], test_data.Survived)
Out[16]:
With a simple decision tree, the result above indicates that it's possible to correctly predict the survival of circa 80% of the passengers.
An interesting exercise to do after training a predictive model is to take a look at the cases where it missed:
In [17]:
test_data.is_copy = False
test_data['Predicted'] = tree.predict(test_data[['Sex', 'Age', 'Fare']])
test_data[test_data.Predicted != test_data.Survived]
Out[17]:
One example of a wrong prediction above is the case of passenger named Mrs. Hudson J C Allison, that didn't survive the Titanic despite being a female person, being 25 years old and having paid an expensive fare. A search on Encyclopedia Titanica reveals that she was informed, after having been put into a lifeboat, that her son was embarked in another lifeboat in the opposite side of the ship - Mrs. Allison then ran away from her boat in an attempt to reach to her son but to no avail.
A particularly interesting collection of stories related to the Titanic passengers can be found in this post.